MCIC Wooster, OSU
2024-01-25
Two types of high-throughput sequencing (HTS), which sequence 105-109 usually randomly selected DNA fragments (reads) at a time:
https://en.wikipedia.org/wiki/DNA_sequencing#/media/File:History_of_sequencing_technology.jpg
Sequencing is performed by synthesizing a new DNA strand in part with fluorescently-labeled nucleotides (one color per base).
Visualization is not done in real time, but after the fact — sequences of each possible length are produced (with flurorescent labeling only for the last base), and these can be separated afterwards.
The final result is a chromatogram that can be base-called:
The entire human genome was sequenced with Sanger technology! (More on that later.)
Amplification of a DNA fragment can be done through bacterial cloning or PCR — these days mostly with PCR.
This means that you need to (approximately) know in advance short flanking sequences to the sequence of interest — primers for your PCR.
Introns are good targets to sequence: variable sequences flanked by conserved sequences (exons) in which primers can be designed.
Common current applications of Sanger sequencing include:
Examining variation among individuals or populations in one or more candidate or marker genes (for population genetics, phylogenetics, functional inferences, etc.)
Taxonomic identification of a sample
When are longer reads useful?
Genome assembly
Haplotyping
Transcript isoform identification
Taxonomic identification of single reads (microbial metabarcoding)
When does it not matter (as much)?
Read-as-a-tag: when we just need to know the a read’s origin in a reference genome, like in counting applications such as RNA-seq
Variant analysis
What about RNA sequencing?
This lecture technically deals with DNA sequencing — however, it includes the indirect sequencing of RNA after reverse transcription to cDNA. (The direct sequencing of RNA is possible but hard and outside of the scope of this lecture.)
Currently, no sequencing technology is error-free, and several types of errors can occur:
Base call errors, e.g. a base that was called as an A may instead be a G.
Insertion or deletion (indel) errors
When the base calling software is not confident at all, it can also Ns (= undetermined).
Quality scores in sequence data
When you receive sequences from a high-throughput sequencer, base calls have typically already been made. You will then receive your reads in so-called FASTQ files (more on those later) and every base in every read will be accompanied by a quality score, which is inversely related to the estimated error probability.
Coverage
Distinguishing sequencing errors from biological variation
Random vs nonrandom errors
More reads, lower per-base cost, and lower error rates and than long-read sequencing. The lower error rate advantage is disappearing as long-read technologies keep improving (and Illumina does not).
Like Sanger, sequencing is done by synthesizing a new strand and using fluorescently labeled bases.
[TODO: Include Illumina]
[TODO: Table of machines]
Adapters etc
Single-end vs. paired-end
Includes a very small sequencer, the MinION
Continuing rapid development in technology and bioinformatics software
Whole-genome assembly: very high-depth, best with a combination of long and short reads)
Variant analysis for population genetics/genomics, molecular evolution, GWAS:
Whole-genome resequencing
Reduced-representation libraries (e.g. RADseq, GBS)
Transcriptomics with RNA-seq
Other functional sequencing methods like ChIP-seq, Methyl-seq, etc
Microbial community characterization
Metabarcoding
Shotgun metagenomics
Sequence files and other genomic data files are plain-text files. We will see a couple more formats when learning about RNA-seq next week, but today we will learn about the FASTA format.
NCBI Genbank
NCBI RefSeq
NCBI SRA
Proteins:
UniProt
Protein Data Bank (3D structures)